This tutorial demonstrates how to use R to perform data analytics tasks covered in Week 02. We’ll explore:
By the end of this tutorial, you will be able to:
# Install packages if needed (uncomment to install)
# install.packages(c("tidyverse", "ggplot2", "dplyr", "knitr", "kableExtra"))
# install.packages("seedhash") # For reproducible seed generation
# Load required packages
library(tidyverse) # Data manipulation and visualization
library(ggplot2) # Advanced plotting
library(dplyr) # Data manipulation
library(knitr) # Table formatting
library(kableExtra) # Enhanced tables
library(seedhash) # Reproducible seed generation
# Display seed information
cat("\n=== REPRODUCIBLE SEED INFORMATION ===")##
## === REPRODUCIBLE SEED INFORMATION ===
##
## Generator Name: A dog is a man's best friend
##
## MD5 Hash: 0e112cb70d324ea5e5b9f803b4488bde
##
## Available Seeds: -805293604, -909845485, -339394968, 586956699, -960637137 ...
R handles different types of variables. Let’s explore them:
Key term – measurement level: Field, Miles, and Field (2012, ch. 2) emphasize that correctly classifying each variable’s measurement level determines which descriptive summaries and hypothesis tests are valid. Treating an ordinal scale (for example, satisfaction ratings) as if it were interval can distort estimates because the distance between categories is not guaranteed to be equal (Field et al., 2012).
# Numeric (continuous) variables
age <- 25
height <- 175.5
temperature <- 98.6
# Integer variables
num_students <- 30L
year <- 2025L
# Character (string) variables
name <- "John Doe"
city <- "Washington DC"
# Logical (boolean) variables
is_student <- TRUE
passed_exam <- FALSE
# Factor (categorical) variables
education_level <- factor(c("High School", "Bachelor", "Master", "PhD"))
grade <- factor(c("A", "B", "C", "D", "F"), ordered = TRUE)
# Check variable types
cat("Type of age:", class(age), "\n")## Type of age: numeric
## Type of name: character
## Type of is_student: logical
## Type of education_level: factor
Categorical variables capture group membership. Field et al. (2012, ch. 2) note that we summarize them with counts or proportions and we model them with tests designed for frequencies (such as the chi-square test introduced later in Part 5).
Binary (or dichotomous) variables only have two possible outcomes (Field et al., 2012). Because every case must fall into exactly one category, proportions are a natural summary and form the foundation for later logistic models.
# Binary variable example: Pass/Fail
pass_fail <- c("Pass", "Fail", "Pass", "Pass", "Fail")
pass_fail_factor <- factor(pass_fail)
cat("Binary variable levels:", levels(pass_fail_factor), "\n")## Binary variable levels: Fail Pass
## pass_fail_factor
## Fail Pass
## 2 3
Nominal scales label unordered categories such as department names. Field et al. (2012) highlight that although the numbers we assign to these categories are arbitrary, the counts still convey how often each outcome appears.
# Nominal: Categories without order
departments <- factor(c("HR", "IT", "Finance", "Marketing", "IT", "HR", "Finance"))
cat("Department frequency:\n")## Department frequency:
## departments
## Finance HR IT Marketing
## 2 2 2 1
# Visualize
barplot(table(departments),
main = "Department Distribution",
col = "steelblue",
ylab = "Frequency",
las = 1)Ordinal variables preserve order but not equal spacing between categories (Field et al., 2012). Treat medians or percentiles as your primary summaries because means can hide the unequal steps.
# Ordinal: Categories with logical order
satisfaction <- factor(
c("Very Unsatisfied", "Unsatisfied", "Neutral", "Satisfied", "Very Satisfied",
"Satisfied", "Neutral", "Very Satisfied", "Satisfied", "Unsatisfied"),
levels = c("Very Unsatisfied", "Unsatisfied", "Neutral", "Satisfied", "Very Satisfied"),
ordered = TRUE
)
cat("Satisfaction levels:\n")## Satisfaction levels:
## satisfaction
## Very Unsatisfied Unsatisfied Neutral Satisfied
## 1 2 2 3
## Very Satisfied
## 2
# Visualize with ordered categories
barplot(table(satisfaction),
main = "Customer Satisfaction Survey",
col = rainbow(5),
ylab = "Count",
las = 2,
cex.names = 0.8)Continuous variables have meaningful numeric scales, allowing arithmetic operations. Field et al. (2012) distinguish between interval scales (equal steps without a true zero) and ratio scales (equal steps with a meaningful zero), which guides whether ratios like “twice as much” are interpretable.
# Interval: Equal intervals but no true zero (e.g., temperature in Celsius)
temperatures <- c(15, 18, 22, 25, 20, 16, 19, 23, 21, 17)
cat("Temperature Statistics:\n")## Temperature Statistics:
## Mean: 19.6 °C
## Median: 19.5 °C
## Range: 15 to 25 °C
# Note: 0°C doesn't mean "no temperature"
hist(temperatures,
main = "Temperature Distribution",
xlab = "Temperature (°C)",
col = "lightblue",
border = "white")# Ratio: Equal intervals AND true zero (e.g., height, weight, income)
heights <- c(160, 175, 182, 168, 155, 190, 172, 165, 178, 185)
cat("Height Statistics:\n")## Height Statistics:
## Mean: 173 cm
## Median: 173.5 cm
## Note: 0 cm would mean no height (true zero)
# Ratio calculations make sense
cat("\nRatio example: 180cm is", 180/90, "times taller than 90cm\n")##
## Ratio example: 180cm is 2 times taller than 90cm
Let’s work with built-in R datasets and real examples:
# Load built-in datasets
data(iris) # Famous flower dataset
data(mtcars) # Motor cars data
data(airquality) # Air quality measurements
# Preview datasets
cat("Iris dataset (first 6 rows):\n")## Iris dataset (first 6 rows):
| Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species |
|---|---|---|---|---|
| 5.1 | 3.5 | 1.4 | 0.2 | setosa |
| 4.9 | 3.0 | 1.4 | 0.2 | setosa |
| 4.7 | 3.2 | 1.3 | 0.2 | setosa |
| 4.6 | 3.1 | 1.5 | 0.2 | setosa |
| 5.0 | 3.6 | 1.4 | 0.2 | setosa |
| 5.4 | 3.9 | 1.7 | 0.4 | setosa |
Field et al. (2012, ch. 4) describe the mean as the balance point of a distribution: the point at which the total distance of scores above equals the total distance below. Because it uses every score, the mean is efficient but sensitive to outliers, so always pair it with a visualization to check for extreme values.
# Calculate mean
sepal_lengths <- iris$Sepal.Length
mean_value <- mean(sepal_lengths)
cat("Mean Sepal Length:", round(mean_value, 2), "cm\n")## Mean Sepal Length: 5.84 cm
# Visualize with mean line
hist(sepal_lengths,
main = "Sepal Length Distribution with Mean",
xlab = "Sepal Length (cm)",
col = "lightgreen",
border = "white")
abline(v = mean_value, col = "red", lwd = 3, lty = 2)
legend("topright", legend = paste("Mean =", round(mean_value, 2)),
col = "red", lty = 2, lwd = 3)Understanding the Mean:
\[\bar{X} = \frac{\sum_{i=1}^{n} x_i}{n} = \frac{x_1 + x_2 + ... + x_n}{n}\]
# Manual calculation to understand the formula
sum_values <- sum(sepal_lengths)
n_values <- length(sepal_lengths)
manual_mean <- sum_values / n_values
cat("Sum of values:", sum_values, "\n")## Sum of values: 876.5
## Number of values: 150
## Manual mean calculation: 5.84
## R's mean() function: 5.84
The median splits the ordered data in half. Field et al. (2012) note that because it is based on rank rather than magnitude, the median resists the influence of extreme scores and is a better summary than the mean for skewed or ordinal data.
# Calculate median
median_value <- median(sepal_lengths)
cat("Median Sepal Length:", round(median_value, 2), "cm\n")## Median Sepal Length: 5.8 cm
## Mean Sepal Length: 5.84 cm
## Difference: 0.043 cm
# Visualize both
hist(sepal_lengths,
main = "Sepal Length: Mean vs Median",
xlab = "Sepal Length (cm)",
col = "skyblue",
border = "white")
abline(v = mean_value, col = "red", lwd = 3, lty = 2)
abline(v = median_value, col = "blue", lwd = 3, lty = 2)
legend("topright",
legend = c(paste("Mean =", round(mean_value, 2)),
paste("Median =", round(median_value, 2))),
col = c("red", "blue"),
lty = 2,
lwd = 3)Finding the Median:
For odd number of values: Middle value For even number of values: Average of two middle values
# Example with manual calculation
facebook_friends <- c(22, 40, 53, 57, 93, 98, 103, 108, 116, 121, 252)
sorted_friends <- sort(facebook_friends)
cat("Sorted values:", sorted_friends, "\n")## Sorted values: 22 40 53 57 93 98 103 108 116 121 252
## Position of median: (n+1)/2 = (11+1)/2 = 6
## Median value: 98
# Remove extreme value
friends_no_extreme <- facebook_friends[facebook_friends != 252]
cat("\nWithout extreme value (252):\n")##
## Without extreme value (252):
## New median: 95.5
## Change: 2.5
The mode identifies the most frequently occurring category or value. Field et al. (2012) recommend reporting it alongside the median for ordinal data because it highlights the most common response option.
# R doesn't have a built-in mode function, so we create one
get_mode <- function(x) {
unique_vals <- unique(x)
freq_table <- tabulate(match(x, unique_vals))
unique_vals[which.max(freq_table)]
}
# Example with discrete data
grades <- c("A", "B", "B", "C", "B", "A", "C", "B", "D", "B", "C")
mode_grade <- get_mode(grades)
cat("Mode (most frequent grade):", mode_grade, "\n")## Mode (most frequent grade): B
##
## Frequency table:
## grades
## A B C D
## 2 5 3 1
# Visualize
barplot(table(grades),
main = "Grade Distribution (Mode in Red)",
col = ifelse(names(table(grades)) == mode_grade, "red", "steelblue"),
ylab = "Frequency")# Create a comparison table
iris_summary <- iris %>%
group_by(Species) %>%
summarise(
Mean = mean(Sepal.Length),
Median = median(Sepal.Length),
Min = min(Sepal.Length),
Max = max(Sepal.Length),
N = n()
) %>%
mutate(across(where(is.numeric), ~round(., 2)))
iris_summary %>%
kable(caption = "Central Tendency by Species") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))| Species | Mean | Median | Min | Max | N |
|---|---|---|---|---|---|
| setosa | 5.01 | 5.0 | 4.3 | 5.8 | 50 |
| versicolor | 5.94 | 5.9 | 4.9 | 7.0 | 50 |
| virginica | 6.59 | 6.5 | 4.9 | 7.9 | 50 |
The range communicates the full spread of the data (max − min). Field et al. (2012) point out that it is easy to compute yet heavily influenced by a single extreme value, so treat it as a quick diagnostic rather than a robust statistic.
# Calculate range
range_vals <- range(sepal_lengths)
range_value <- diff(range_vals)
cat("Minimum:", range_vals[1], "\n")## Minimum: 4.3
## Maximum: 7.9
## Range: 3.6
##
## Effect of extreme values:
## Original range: 230
## Without extreme (252): 99
## Reduction: 131
Field et al. (2012) advocate using the interquartile range when you want the spread of the middle 50% of scores. Because it ignores the most extreme quartiles, the IQR pairs naturally with the median for skewed distributions.
# Calculate quartiles
q1 <- quantile(sepal_lengths, 0.25)
q2 <- quantile(sepal_lengths, 0.50) # This is the median
q3 <- quantile(sepal_lengths, 0.75)
iqr_value <- IQR(sepal_lengths)
cat("Lower Quartile (Q1):", q1, "\n")## Lower Quartile (Q1): 5.1
## Median (Q2): 5.8
## Upper Quartile (Q3): 6.4
## Interquartile Range (IQR): 1.3
# Boxplot shows quartiles
boxplot(sepal_lengths,
main = "Boxplot Showing Quartiles",
ylab = "Sepal Length (cm)",
col = "lightblue",
horizontal = TRUE)
text(x = q1, y = 1.3, labels = "Q1", col = "red", cex = 1.2)
text(x = q2, y = 1.3, labels = "Q2 (Median)", col = "red", cex = 1.2)
text(x = q3, y = 1.3, labels = "Q3", col = "red", cex = 1.2)Variance averages the squared deviations from the mean, while the standard deviation returns to the original units by taking the square root (Field et al., 2012). These statistics assume an interval or ratio scale and provide the backbone for inferential models, including the t tests in Part 5.
# Calculate variance and standard deviation
var_value <- var(sepal_lengths)
sd_value <- sd(sepal_lengths)
cat("Variance:", round(var_value, 4), "\n")## Variance: 0.6857
## Standard Deviation: 0.8281
##
## Interpretation:
## On average, sepal lengths deviate by 0.83 cm from the mean.
# Visualize with standard deviation bands
mean_val <- mean(sepal_lengths)
hist(sepal_lengths,
main = "Distribution with Standard Deviation",
xlab = "Sepal Length (cm)",
col = "lavender",
border = "white")
abline(v = mean_val, col = "red", lwd = 2)
abline(v = mean_val - sd_value, col = "blue", lwd = 2, lty = 2)
abline(v = mean_val + sd_value, col = "blue", lwd = 2, lty = 2)
legend("topright",
legend = c("Mean", "± 1 SD"),
col = c("red", "blue"),
lty = c(1, 2),
lwd = 2)Understanding Variance and Standard Deviation:
\[s^2 = \frac{\sum_{i=1}^{n}(x_i - \bar{x})^2}{n-1}\]
\[s = \sqrt{s^2}\]
# Manual calculation to understand the formula
deviations <- sepal_lengths - mean(sepal_lengths)
squared_deviations <- deviations^2
sum_squared_dev <- sum(squared_deviations)
manual_variance <- sum_squared_dev / (length(sepal_lengths) - 1)
manual_sd <- sqrt(manual_variance)
cat("Manual variance:", round(manual_variance, 4), "\n")## Manual variance: 0.6857
## R's var(): 0.6857
## Manual SD: 0.8281
## R's sd(): 0.8281
# Comprehensive dispersion measures
dispersion_summary <- data.frame(
Measure = c("Range", "IQR", "Variance", "Std Dev", "Coef of Variation"),
Value = c(
diff(range(sepal_lengths)),
IQR(sepal_lengths),
var(sepal_lengths),
sd(sepal_lengths),
sd(sepal_lengths) / mean(sepal_lengths) * 100
),
Unit = c("cm", "cm", "cm²", "cm", "%")
)
dispersion_summary %>%
mutate(Value = round(Value, 3)) %>%
kable(caption = "Measures of Dispersion for Sepal Length") %>%
kable_styling(bootstrap_options = c("striped", "hover"))| Measure | Value | Unit |
|---|---|---|
| Range | 3.600 | cm |
| IQR | 1.300 | cm |
| Variance | 0.686 | cm² |
| Std Dev | 0.828 | cm |
| Coef of Variation | 14.171 | % |
# Create frequency table
sepal_length_cut <- cut(sepal_lengths, breaks = 8)
freq_table <- table(sepal_length_cut)
cat("Frequency Distribution:\n")## Frequency Distribution:
## sepal_length_cut
## (4.3,4.75] (4.75,5.2] (5.2,5.65] (5.65,6.1] (6.1,6.55] (6.55,7] (7,7.45]
## 11 34 20 30 25 18 6
## (7.45,7.9]
## 6
# Create histogram with frequency
hist(sepal_lengths,
main = "Frequency Distribution of Sepal Length",
xlab = "Sepal Length (cm)",
ylab = "Frequency",
col = "steelblue",
border = "white",
breaks = 12)
# Add density curve
lines(density(sepal_lengths), col = "red", lwd = 3)
legend("topright", legend = "Density Curve", col = "red", lwd = 3)# Create different distribution shapes
par(mfrow = c(2, 2))
# Normal distribution
normal_data <- rnorm(1000, mean = 50, sd = 10)
hist(normal_data, main = "Normal Distribution",
col = "lightgreen", border = "white", xlab = "Values")
# Right-skewed (positive skew)
skewed_right <- rexp(1000, rate = 0.5)
hist(skewed_right, main = "Right Skewed (Positive)",
col = "lightcoral", border = "white", xlab = "Values")
# Left-skewed (negative skew)
skewed_left <- -rexp(1000, rate = 0.5)
hist(skewed_left, main = "Left Skewed (Negative)",
col = "lightblue", border = "white", xlab = "Values")
# Uniform distribution
uniform_data <- runif(1000, min = 0, max = 100)
hist(uniform_data, main = "Uniform Distribution",
col = "lavender", border = "white", xlab = "Values")library(moments) # For skewness and kurtosis
# Calculate shape statistics
skewness_val <- skewness(sepal_lengths)
kurtosis_val <- kurtosis(sepal_lengths)
cat("Skewness:", round(skewness_val, 3), "\n")## Skewness: 0.312
cat("Interpretation:",
ifelse(abs(skewness_val) < 0.5, "Approximately Symmetric",
ifelse(skewness_val > 0, "Right-skewed (Positive)", "Left-skewed (Negative)")), "\n\n")## Interpretation: Approximately Symmetric
## Kurtosis: 2.426
cat("Interpretation:",
ifelse(kurtosis_val > 3, "Leptokurtic (heavy tails)",
ifelse(kurtosis_val < 3, "Platykurtic (light tails)", "Mesokurtic (normal)")), "\n")## Interpretation: Platykurtic (light tails)
## === IRIS DATASET SUMMARY ===
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
# More detailed summary by species
iris_detailed <- iris %>%
group_by(Species) %>%
summarise(
N = n(),
Mean_Sepal_Length = mean(Sepal.Length),
SD_Sepal_Length = sd(Sepal.Length),
Mean_Sepal_Width = mean(Sepal.Width),
SD_Sepal_Width = sd(Sepal.Width),
Mean_Petal_Length = mean(Petal.Length),
SD_Petal_Length = sd(Petal.Length)
) %>%
mutate(across(where(is.numeric), ~round(., 2)))
iris_detailed %>%
kable(caption = "Detailed Summary by Species") %>%
kable_styling(bootstrap_options = c("striped", "hover")) %>%
scroll_box(width = "100%")| Species | N | Mean_Sepal_Length | SD_Sepal_Length | Mean_Sepal_Width | SD_Sepal_Width | Mean_Petal_Length | SD_Petal_Length |
|---|---|---|---|---|---|---|---|
| setosa | 50 | 5.01 | 0.35 | 3.43 | 0.38 | 1.46 | 0.17 |
| versicolor | 50 | 5.94 | 0.52 | 2.77 | 0.31 | 4.26 | 0.47 |
| virginica | 50 | 6.59 | 0.64 | 2.97 | 0.32 | 5.55 | 0.55 |
# Multiple ways to visualize single variable
par(mfrow = c(2, 2))
# Histogram
hist(iris$Sepal.Length,
main = "Histogram",
xlab = "Sepal Length (cm)",
col = "steelblue",
border = "white")
# Density plot
plot(density(iris$Sepal.Length),
main = "Density Plot",
xlab = "Sepal Length (cm)",
lwd = 2,
col = "darkblue")
polygon(density(iris$Sepal.Length), col = rgb(0, 0, 1, 0.3))
# Boxplot
boxplot(iris$Sepal.Length,
main = "Boxplot",
ylab = "Sepal Length (cm)",
col = "lightgreen")
# Q-Q plot (check normality)
qqnorm(iris$Sepal.Length, main = "Q-Q Plot")
qqline(iris$Sepal.Length, col = "red", lwd = 2)# Relationship between two variables
ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width, color = Species)) +
geom_point(size = 3, alpha = 0.7) +
geom_smooth(method = "lm", se = FALSE) +
labs(
title = "Sepal Length vs Sepal Width by Species",
x = "Sepal Length (cm)",
y = "Sepal Width (cm)"
) +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5, face = "bold"))# Pairs plot for multiple variables
pairs(iris[, 1:4],
main = "Scatterplot Matrix of Iris Dataset",
pch = 21,
bg = c("red", "green", "blue")[unclass(iris$Species)])# Faceted visualization
ggplot(iris, aes(x = Sepal.Length, fill = Species)) +
geom_histogram(bins = 20, alpha = 0.7, color = "white") +
facet_wrap(~Species, ncol = 1) +
labs(
title = "Sepal Length Distribution by Species",
x = "Sepal Length (cm)",
y = "Frequency"
) +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.5, face = "bold"),
legend.position = "none"
)## === MISSING VALUES CHECK ===
missing_summary <- data.frame(
Variable = names(iris),
Missing_Count = colSums(is.na(iris)),
Missing_Percent = round(colSums(is.na(iris)) / nrow(iris) * 100, 2)
)
missing_summary %>%
kable(caption = "Missing Values Summary") %>%
kable_styling(bootstrap_options = c("striped", "hover"))| Variable | Missing_Count | Missing_Percent | |
|---|---|---|---|
| Sepal.Length | Sepal.Length | 0 | 0 |
| Sepal.Width | Sepal.Width | 0 | 0 |
| Petal.Length | Petal.Length | 0 | 0 |
| Petal.Width | Petal.Width | 0 | 0 |
| Species | Species | 0 | 0 |
# Check for outliers using IQR method
detect_outliers <- function(x) {
q1 <- quantile(x, 0.25)
q3 <- quantile(x, 0.75)
iqr <- q3 - q1
lower_bound <- q1 - 1.5 * iqr
upper_bound <- q3 + 1.5 * iqr
sum(x < lower_bound | x > upper_bound)
}
cat("\n=== OUTLIER DETECTION ===\n")##
## === OUTLIER DETECTION ===
outlier_summary <- iris %>%
select(where(is.numeric)) %>%
summarise(across(everything(), detect_outliers)) %>%
pivot_longer(everything(), names_to = "Variable", values_to = "Outlier_Count")
outlier_summary %>%
kable(caption = "Outlier Count by Variable (IQR Method)") %>%
kable_styling(bootstrap_options = c("striped", "hover"))| Variable | Outlier_Count |
|---|---|
| Sepal.Length | 0 |
| Sepal.Width | 4 |
| Petal.Length | 0 |
| Petal.Width | 0 |
# Build a simple linear regression model
# Predict Sepal Width from Sepal Length
model_simple <- lm(Sepal.Width ~ Sepal.Length, data = iris)
# View model summary
summary(model_simple)##
## Call:
## lm(formula = Sepal.Width ~ Sepal.Length, data = iris)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.1095 -0.2454 -0.0167 0.2763 1.3338
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.41895 0.25356 13.48 <2e-16 ***
## Sepal.Length -0.06188 0.04297 -1.44 0.152
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4343 on 148 degrees of freedom
## Multiple R-squared: 0.01382, Adjusted R-squared: 0.007159
## F-statistic: 2.074 on 1 and 148 DF, p-value: 0.1519
# Extract key statistics
r_squared <- summary(model_simple)$r.squared
adj_r_squared <- summary(model_simple)$adj.r.squared
coef_intercept <- coef(model_simple)[1]
coef_slope <- coef(model_simple)[2]
cat("\n=== MODEL INTERPRETATION ===\n")##
## === MODEL INTERPRETATION ===
cat("Equation: Sepal.Width =", round(coef_intercept, 3), "+",
round(coef_slope, 3), "× Sepal.Length\n")## Equation: Sepal.Width = 3.419 + -0.062 × Sepal.Length
## R-squared: 0.0138
## Adjusted R-squared: 0.0072
# Plot with regression line
ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width)) +
geom_point(aes(color = Species), size = 3, alpha = 0.6) +
geom_smooth(method = "lm", se = TRUE, color = "blue", fill = "lightblue") +
labs(
title = "Simple Linear Regression: Sepal Width vs Sepal Length",
subtitle = paste("R² =", round(r_squared, 3)),
x = "Sepal Length (cm)",
y = "Sepal Width (cm)"
) +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5, face = "bold"))##
## === DIAGNOSTIC INTERPRETATION ===
## 1. Residuals vs Fitted: Check for linearity and homoscedasticity
## 2. Q-Q Plot: Check if residuals are normally distributed
## 3. Scale-Location: Check homoscedasticity
## 4. Residuals vs Leverage: Identify influential points
# Build multiple regression model
# Predict Petal Length from multiple predictors
model_multiple <- lm(Petal.Length ~ Sepal.Length + Sepal.Width + Petal.Width,
data = iris)
# View summary
summary(model_multiple)##
## Call:
## lm(formula = Petal.Length ~ Sepal.Length + Sepal.Width + Petal.Width,
## data = iris)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.99333 -0.17656 -0.01004 0.18558 1.06909
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.26271 0.29741 -0.883 0.379
## Sepal.Length 0.72914 0.05832 12.502 <2e-16 ***
## Sepal.Width -0.64601 0.06850 -9.431 <2e-16 ***
## Petal.Width 1.44679 0.06761 21.399 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.319 on 146 degrees of freedom
## Multiple R-squared: 0.968, Adjusted R-squared: 0.9674
## F-statistic: 1473 on 3 and 146 DF, p-value: < 2.2e-16
##
## === MODEL COMPARISON ===
comparison <- data.frame(
Model = c("Simple Regression", "Multiple Regression"),
Predictors = c(1, 3),
R_squared = c(
summary(model_simple)$r.squared,
summary(model_multiple)$r.squared
),
Adj_R_squared = c(
summary(model_simple)$adj.r.squared,
summary(model_multiple)$adj.r.squared
),
RMSE = c(
sqrt(mean(model_simple$residuals^2)),
sqrt(mean(model_multiple$residuals^2))
)
)
comparison %>%
mutate(across(where(is.numeric), ~round(., 4))) %>%
kable(caption = "Model Performance Comparison") %>%
kable_styling(bootstrap_options = c("striped", "hover"))| Model | Predictors | R_squared | Adj_R_squared | RMSE |
|---|---|---|---|---|
| Simple Regression | 1 | 0.0138 | 0.0072 | 0.4314 |
| Multiple Regression | 3 | 0.9680 | 0.9674 | 0.3147 |
# Create new data for predictions
new_data <- data.frame(
Sepal.Length = c(5.0, 6.0, 7.0),
Sepal.Width = c(3.0, 3.5, 3.2),
Petal.Width = c(1.5, 2.0, 2.5)
)
# Make predictions
predictions <- predict(model_multiple, newdata = new_data, interval = "prediction")
# Combine with input data
prediction_results <- cbind(new_data, predictions) %>%
mutate(across(where(is.numeric), ~round(., 2)))
prediction_results %>%
kable(caption = "Predictions with 95% Confidence Intervals") %>%
kable_styling(bootstrap_options = c("striped", "hover"))| Sepal.Length | Sepal.Width | Petal.Width | fit | lwr | upr |
|---|---|---|---|---|---|
| 5 | 3.0 | 1.5 | 3.62 | 2.97 | 4.26 |
| 6 | 3.5 | 2.0 | 4.74 | 4.10 | 5.39 |
| 7 | 3.2 | 2.5 | 6.39 | 5.75 | 7.03 |
A hypothesis is a prediction from a theory that can be tested through data collection and analysis. It’s a specific, testable statement about the relationship between variables.
Key Characteristics of a Good Hypothesis:
| Type | Symbol | Description | Example |
|---|---|---|---|
| Alternative Hypothesis (H₁) | H₁ or Hₐ | States that an effect WILL occur; the research prediction | Manual transmission cars have higher MPG than automatic cars |
| Null Hypothesis (H₀) | H₀ | States that NO effect will occur; opposite of alternative hypothesis | Manual and automatic cars have the same MPG |
| Directional Hypothesis | H₁ (one-tailed) | Predicts the DIRECTION of the effect (e.g., ‘greater than’, ‘less than’) | Manual cars have HIGHER MPG than automatic cars |
| Non-directional Hypothesis | H₁ (two-tailed) | Predicts an effect exists but NOT its direction (e.g., ‘different from’) | Manual and automatic cars have DIFFERENT MPG |
Important Concept: We can never prove the alternative hypothesis using statistics. We can only:
“We cannot talk about the null hypothesis being true or the experimental hypothesis being true, we can only talk in terms of the probability of obtaining a particular set of data if, hypothetically speaking, the null hypothesis was true.”
— Field, Miles, & Field (2012)
Figure: Hypothesis Testing Logic Flow
The p-value is the probability of obtaining your observed data (or more extreme) assuming the null hypothesis is true.
Field et al. (2012, ch. 9) stress that this probability is conditional: it tells us how compatible the sample is with H₀ rather than the probability that H₀ itself is true. Always interpret the p-value alongside effect sizes and research context instead of a mechanical “significant/not significant” decision (Field et al., 2012).
Interpretation: - Small p-value (typically < 0.05): Data is unlikely under H₀ → Reject H₀ - Large p-value (≥ 0.05): Data is plausible under H₀ → Fail to reject H₀
Common Significance Levels: - α = 0.05 (5%) - Most common in social sciences - α = 0.01 (1%) - More stringent - α = 0.10 (10%) - More lenient (exploratory research)
Research Question: Is the average sepal length of iris flowers different from 6 cm?
Hypotheses: - H₀: μ = 6 (population mean equals 6) - H₁: μ ≠ 6 (population mean is different from 6) [two-tailed]
# One-sample t-test
# Test if mean sepal length differs from 6 cm
test_value <- 6
# Perform the test
result_onesample <- t.test(iris$Sepal.Length, mu = test_value)
cat("=== ONE-SAMPLE T-TEST ===\n\n")## === ONE-SAMPLE T-TEST ===
## Research Question: Is mean sepal length different from 6 cm?
## H₀: μ = 6
## H₁: μ ≠ 6
##
## One Sample t-test
##
## data: iris$Sepal.Length
## t = -2.3172, df = 149, p-value = 0.02186
## alternative hypothesis: true mean is not equal to 6
## 95 percent confidence interval:
## 5.709732 5.976934
## sample estimates:
## mean of x
## 5.843333
##
## === INTERPRETATION ===
## Sample mean: 5.843 cm
## t-statistic: -2.317
## p-value: 0.02186
cat("95% CI: [", round(result_onesample$conf.int[1], 3), ",",
round(result_onesample$conf.int[2], 3), "]\n\n")## 95% CI: [ 5.71 , 5.977 ]
if(result_onesample$p.value < 0.05) {
cat("Decision: REJECT H₀ (p < 0.05)\n")
cat("Conclusion: The mean sepal length is significantly different from", test_value, "cm.\n")
} else {
cat("Decision: FAIL TO REJECT H₀ (p ≥ 0.05)\n")
cat("Conclusion: Insufficient evidence that mean differs from", test_value, "cm.\n")
}## Decision: REJECT H₀ (p < 0.05)
## Conclusion: The mean sepal length is significantly different from 6 cm.
# Visualize
hist(iris$Sepal.Length,
main = "Sepal Length Distribution with Test Value",
xlab = "Sepal Length (cm)",
col = "lightblue",
border = "white",
breaks = 15)
abline(v = mean(iris$Sepal.Length), col = "red", lwd = 3, lty = 1)
abline(v = test_value, col = "blue", lwd = 3, lty = 2)
legend("topright",
legend = c(paste("Sample Mean =", round(mean(iris$Sepal.Length), 2)),
paste("Test Value =", test_value)),
col = c("red", "blue"),
lty = c(1, 2),
lwd = 3)Research Question: Do manual transmission cars have different fuel efficiency than automatic transmission cars?
Hypotheses: - H₀: μ_manual = μ_automatic (no difference in mean MPG) - H₁: μ_manual ≠ μ_automatic (means are different) [two-tailed]
# Independent samples t-test
# Compare MPG between automatic and manual transmission
# Prepare data
automatic <- mtcars$mpg[mtcars$am == 0]
manual <- mtcars$mpg[mtcars$am == 1]
cat("=== INDEPENDENT TWO-SAMPLE T-TEST ===\n\n")## === INDEPENDENT TWO-SAMPLE T-TEST ===
## Research Question: Do manual and automatic cars differ in fuel efficiency?
## H₀: μ_manual = μ_automatic
## H₁: μ_manual ≠ μ_automatic
# Perform t-test
result_independent <- t.test(manual, automatic, var.equal = FALSE)
print(result_independent)##
## Welch Two Sample t-test
##
## data: manual and automatic
## t = 3.7671, df = 18.332, p-value = 0.001374
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 3.209684 11.280194
## sample estimates:
## mean of x mean of y
## 24.39231 17.14737
##
## === DESCRIPTIVE STATISTICS ===
## Automatic transmission:
## n = 19
## Mean = 17.15 mpg
## SD = 3.83
## Manual transmission:
## n = 13
## Mean = 24.39 mpg
## SD = 6.17
## === INTERPRETATION ===
## Difference in means: 7.24 mpg
## t-statistic: 3.767
## p-value: 0.001374
if(result_independent$p.value < 0.05) {
cat("Decision: REJECT H₀ (p < 0.05)\n")
cat("Conclusion: Manual and automatic transmissions have significantly different MPG.\n")
cat("Manual transmission cars have",
ifelse(mean(manual) > mean(automatic), "HIGHER", "LOWER"),
"fuel efficiency.\n")
} else {
cat("Decision: FAIL TO REJECT H₀ (p ≥ 0.05)\n")
cat("Conclusion: No significant difference in MPG between transmission types.\n")
}## Decision: REJECT H₀ (p < 0.05)
## Conclusion: Manual and automatic transmissions have significantly different MPG.
## Manual transmission cars have HIGHER fuel efficiency.
# Visualize
boxplot(mpg ~ am, data = mtcars,
names = c("Automatic", "Manual"),
main = "Fuel Efficiency by Transmission Type",
ylab = "Miles per Gallon (MPG)",
col = c("lightcoral", "lightgreen"),
border = c("darkred", "darkgreen"))
points(c(1, 2), c(mean(automatic), mean(manual)),
pch = 19, col = "blue", cex = 2)
legend("bottomright", legend = "Mean", pch = 19, col = "blue", cex = 1.2)Research Question: Does a new teaching method improve test scores?
Hypotheses: - H₀: μ_difference = 0 (no change in scores) - H₁: μ_difference > 0 (scores increased) [one-tailed]
# Paired t-test
# Create example data: before and after scores
set.seed(seeds[3])
n_students <- 25
before_scores <- rnorm(n_students, mean = 70, sd = 10)
# Add improvement effect
improvement <- rnorm(n_students, mean = 5, sd = 8)
after_scores <- before_scores + improvement
student_data <- data.frame(
Student = 1:n_students,
Before = round(before_scores, 1),
After = round(after_scores, 1),
Difference = round(after_scores - before_scores, 1)
)
cat("=== PAIRED T-TEST (BEFORE-AFTER DESIGN) ===\n\n")## === PAIRED T-TEST (BEFORE-AFTER DESIGN) ===
## Research Question: Did the new teaching method improve test scores?
## H₀: μ_difference = 0 (no improvement)
## H₁: μ_difference > 0 (scores improved) [one-tailed]
## Sample Data (first 6 students):
| Student | Before | After | Difference |
|---|---|---|---|
| 1 | 85.0 | 84.3 | -0.7 |
| 2 | 54.8 | 55.0 | 0.2 |
| 3 | 67.6 | 86.9 | 19.3 |
| 4 | 59.0 | 64.8 | 5.8 |
| 5 | 77.2 | 90.9 | 13.7 |
| 6 | 72.7 | 81.5 | 8.9 |
# Perform paired t-test (one-tailed)
result_paired <- t.test(student_data$After, student_data$Before,
paired = TRUE,
alternative = "greater")
cat("\n")##
## Paired t-test
##
## data: student_data$After and student_data$Before
## t = 3.1343, df = 24, p-value = 0.002251
## alternative hypothesis: true mean difference is greater than 0
## 95 percent confidence interval:
## 2.44324 Inf
## sample estimates:
## mean difference
## 5.38
##
## === DESCRIPTIVE STATISTICS ===
cat("Before: Mean =", round(mean(student_data$Before), 2),
", SD =", round(sd(student_data$Before), 2), "\n")## Before: Mean = 70.15 , SD = 11.15
cat("After: Mean =", round(mean(student_data$After), 2),
", SD =", round(sd(student_data$After), 2), "\n")## After: Mean = 75.53 , SD = 14.87
## Mean Difference: 5.38 points
## === INTERPRETATION ===
## t-statistic: 3.134
## p-value (one-tailed): 0.002251
if(result_paired$p.value < 0.05) {
cat("Decision: REJECT H₀ (p < 0.05)\n")
cat("Conclusion: The teaching method significantly improved test scores.\n")
} else {
cat("Decision: FAIL TO REJECT H₀ (p ≥ 0.05)\n")
cat("Conclusion: No significant evidence of improvement.\n")
}## Decision: REJECT H₀ (p < 0.05)
## Conclusion: The teaching method significantly improved test scores.
# Visualize
par(mfrow = c(1, 2))
# Box plot comparison
boxplot(student_data$Before, student_data$After,
names = c("Before", "After"),
main = "Score Comparison",
ylab = "Test Score",
col = c("lightcoral", "lightgreen"))
# Difference histogram
hist(student_data$Difference,
main = "Score Changes",
xlab = "Difference (After - Before)",
col = "lightblue",
border = "white")
abline(v = 0, col = "red", lwd = 3, lty = 2)
abline(v = mean(student_data$Difference), col = "blue", lwd = 3)
legend("topright",
legend = c("No Change", "Mean Change"),
col = c("red", "blue"),
lty = c(2, 1),
lwd = 3)Research Question: Is there a relationship between transmission type and number of cylinders?
## === CHI-SQUARE TEST OF INDEPENDENCE ===
## Research Question: Are transmission type and cylinder count independent?
## H₀: Transmission and cylinders are independent (no relationship)
## H₁: Transmission and cylinders are related
# Create contingency table
contingency_table <- table(mtcars$am, mtcars$cyl)
rownames(contingency_table) <- c("Automatic", "Manual")
colnames(contingency_table) <- paste(c(4, 6, 8), "cyl")
cat("Contingency Table:\n")## Contingency Table:
##
## 4 cyl 6 cyl 8 cyl
## Automatic 3 4 12
## Manual 8 3 2
##
## Pearson's Chi-squared test
##
## data: contingency_table
## X-squared = 8.7407, df = 2, p-value = 0.01265
##
## === INTERPRETATION ===
## Chi-square statistic: 8.741
## p-value: 0.01265
if(result_chisq$p.value < 0.05) {
cat("Decision: REJECT H₀ (p < 0.05)\n")
cat("Conclusion: There is a significant relationship between transmission and cylinders.\n")
} else {
cat("Decision: FAIL TO REJECT H₀ (p ≥ 0.05)\n")
cat("Conclusion: No significant relationship detected.\n")
}## Decision: REJECT H₀ (p < 0.05)
## Conclusion: There is a significant relationship between transmission and cylinders.
# Visualize
barplot(contingency_table,
beside = TRUE,
legend = TRUE,
main = "Transmission Type by Cylinder Count",
xlab = "Number of Cylinders",
ylab = "Frequency",
col = c("lightcoral", "lightgreen"))Research Question: Is there a linear relationship between car weight and fuel efficiency?
## === CORRELATION TEST ===
## Research Question: Is there a correlation between weight and MPG?
## H₀: ρ = 0 (no correlation)
## H₁: ρ ≠ 0 (correlation exists)
##
## Pearson's product-moment correlation
##
## data: mtcars$wt and mtcars$mpg
## t = -9.559, df = 30, p-value = 1.294e-10
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.9338264 -0.7440872
## sample estimates:
## cor
## -0.8676594
##
## === INTERPRETATION ===
## Correlation coefficient (r): -0.868
## t-statistic: -9.559
## p-value: 1.294e-10
if(result_cor$p.value < 0.05) {
cat("Decision: REJECT H₀ (p < 0.05)\n")
cat("Conclusion: There is a significant",
ifelse(result_cor$estimate < 0, "NEGATIVE", "POSITIVE"),
"correlation.\n")
cat("Interpretation:", abs(round(result_cor$estimate, 3)),
"indicates a",
ifelse(abs(result_cor$estimate) > 0.7, "strong",
ifelse(abs(result_cor$estimate) > 0.4, "moderate", "weak")),
"relationship.\n")
} else {
cat("Decision: FAIL TO REJECT H₀ (p ≥ 0.05)\n")
cat("Conclusion: No significant correlation detected.\n")
}## Decision: REJECT H₀ (p < 0.05)
## Conclusion: There is a significant NEGATIVE correlation.
## Interpretation: 0.868 indicates a strong relationship.
# Visualize
plot(mtcars$wt, mtcars$mpg,
main = paste("Weight vs MPG (r =", round(result_cor$estimate, 3), ")"),
xlab = "Weight (1000 lbs)",
ylab = "Miles per Gallon",
pch = 19,
col = "steelblue")
abline(lm(mpg ~ wt, data = mtcars), col = "red", lwd = 2)Effect size measures the magnitude of a difference or relationship, independent of sample size.
Field et al. (2012, ch. 11) argue that reporting effect sizes alongside p-values answers “How big is the effect?” and prevents us from celebrating trivial differences that become significant only because of large samples.
Common Effect Size Measures: - Cohen’s d: For t-tests (small: 0.2, medium: 0.5, large: 0.8) - r: For correlations (small: 0.1, medium: 0.3, large: 0.5) - η² (eta-squared): For ANOVA
# Calculate Cohen's d for transmission comparison
# Install effsize if needed
if (!require("effsize", quietly = TRUE)) {
install.packages("effsize", repos = "https://cran.r-project.org")
library(effsize)
}
cat("=== EFFECT SIZE ANALYSIS ===\n\n")## === EFFECT SIZE ANALYSIS ===
# Cohen's d for transmission effect on MPG
cohens_d <- cohen.d(mtcars$mpg ~ factor(mtcars$am))
print(cohens_d)##
## Cohen's d
##
## d estimate: -1.477947 (large)
## 95 percent confidence interval:
## lower upper
## -2.304209 -0.651685
##
## === INTERPRETATION ===
## Cohen's d: -1.478
## Magnitude: 4
## Effect Size Guidelines (Cohen, 1988):
## Small: d = 0.2
## Medium: d = 0.5
## Large: d = 0.8
## Practical Significance:
if(abs(cohens_d$estimate) < 0.2) {
cat("The effect is negligible - may not be practically important.\n")
} else if(abs(cohens_d$estimate) < 0.5) {
cat("The effect is small - some practical importance.\n")
} else if(abs(cohens_d$estimate) < 0.8) {
cat("The effect is medium - likely practically important.\n")
} else {
cat("The effect is large - very practically important.\n")
}## The effect is large - very practically important.
| Test | Purpose | Data_Type | R_Function |
|---|---|---|---|
| One-Sample t-test | Compare sample mean to known value | Continuous | t.test(x, mu=value) |
| Independent t-test | Compare means of two independent groups | Continuous | t.test(x, y) |
| Paired t-test | Compare means of paired/related observations | Continuous | t.test(x, y, paired=TRUE) |
| Chi-Square Test | Test relationship between categorical variables | Categorical | chisq.test(table) |
| Correlation Test | Test linear relationship between continuous variables | Continuous | cor.test(x, y) |
| ANOVA | Compare means of 3+ groups | Continuous | aov() or anova() |
## === AIR QUALITY DATASET ===
## Dimensions: 153 6
## Variables: Ozone Solar.R Wind Temp Month Day
## Ozone Solar.R Wind Temp
## Min. : 1.00 Min. : 7.0 Min. : 1.700 Min. :56.00
## 1st Qu.: 18.00 1st Qu.:115.8 1st Qu.: 7.400 1st Qu.:72.00
## Median : 31.50 Median :205.0 Median : 9.700 Median :79.00
## Mean : 42.13 Mean :185.9 Mean : 9.958 Mean :77.88
## 3rd Qu.: 63.25 3rd Qu.:258.8 3rd Qu.:11.500 3rd Qu.:85.00
## Max. :168.00 Max. :334.0 Max. :20.700 Max. :97.00
## NA's :37 NA's :7
## Month Day
## Min. :5.000 Min. : 1.0
## 1st Qu.:6.000 1st Qu.: 8.0
## Median :7.000 Median :16.0
## Mean :6.993 Mean :15.8
## 3rd Qu.:8.000 3rd Qu.:23.0
## Max. :9.000 Max. :31.0
##
##
## Missing values by variable:
## Ozone Solar.R Wind Temp Month Day
## 37 7 0 0 0 0
# Visualize relationships
ggplot(airquality, aes(x = Temp, y = Ozone)) +
geom_point(aes(color = Month), alpha = 0.6, size = 3) +
geom_smooth(method = "lm", se = TRUE, color = "red") +
labs(
title = "Ozone Levels vs Temperature",
subtitle = "New York Air Quality Measurements (1973)",
x = "Temperature (°F)",
y = "Ozone (ppb)"
) +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5, face = "bold"))## === MOTOR TRENDS ANALYSIS ===
# Descriptive statistics by transmission type
mtcars_summary <- mtcars %>%
mutate(Transmission = factor(am, labels = c("Automatic", "Manual"))) %>%
group_by(Transmission) %>%
summarise(
N = n(),
Mean_MPG = mean(mpg),
SD_MPG = sd(mpg),
Mean_HP = mean(hp),
SD_HP = sd(hp),
Mean_Weight = mean(wt)
) %>%
mutate(across(where(is.numeric), ~round(., 2)))
mtcars_summary %>%
kable(caption = "Car Performance by Transmission Type") %>%
kable_styling(bootstrap_options = c("striped", "hover"))| Transmission | N | Mean_MPG | SD_MPG | Mean_HP | SD_HP | Mean_Weight |
|---|---|---|---|---|---|---|
| Automatic | 19 | 17.15 | 3.83 | 160.26 | 53.91 | 3.77 |
| Manual | 13 | 24.39 | 6.17 | 126.85 | 84.06 | 2.41 |
# Visualize MPG distribution
ggplot(mtcars, aes(x = factor(am, labels = c("Automatic", "Manual")), y = mpg)) +
geom_boxplot(aes(fill = factor(am)), alpha = 0.7) +
geom_jitter(width = 0.2, alpha = 0.5) +
labs(
title = "Fuel Efficiency by Transmission Type",
x = "Transmission",
y = "Miles per Gallon (MPG)"
) +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.5, face = "bold"),
legend.position = "none"
) +
scale_fill_brewer(palette = "Set2")# Demonstrate population vs sample concepts
# Use seedhash for reproducible sampling
set.seed(seeds[2]) # Use second seed from our generator
# Population: All iris flowers
population_mean <- mean(iris$Sepal.Length)
population_sd <- sd(iris$Sepal.Length)
cat("=== POPULATION STATISTICS ===\n")## === POPULATION STATISTICS ===
## Population Size: 150
## Population Mean: 5.843
## Population SD: 0.828
# Take multiple samples
sample_sizes <- c(10, 30, 50)
sample_results <- list()
for(n in sample_sizes) {
sample_means <- replicate(1000, mean(sample(iris$Sepal.Length, n)))
sample_results[[as.character(n)]] <- data.frame(
Sample_Size = n,
Mean_of_Means = mean(sample_means),
SD_of_Means = sd(sample_means),
Min = min(sample_means),
Max = max(sample_means)
)
}
sample_comparison <- bind_rows(sample_results) %>%
mutate(across(where(is.numeric) & !Sample_Size, ~round(., 4)))
cat("=== SAMPLING DISTRIBUTION ===\n")## === SAMPLING DISTRIBUTION ===
## Based on 1000 samples for each size
sample_comparison %>%
kable(caption = "Effect of Sample Size on Sampling Distribution") %>%
kable_styling(bootstrap_options = c("striped", "hover"))| Sample_Size | Mean_of_Means | SD_of_Means | Min | Max |
|---|---|---|---|---|
| 10 | 5.8449 | 0.2560 | 4.9900 | 6.710 |
| 30 | 5.8446 | 0.1337 | 5.4267 | 6.320 |
| 50 | 5.8371 | 0.0930 | 5.5200 | 6.112 |
# Visualize sampling distributions
par(mfrow = c(1, 3))
for(n in sample_sizes) {
sample_means <- replicate(1000, mean(sample(iris$Sepal.Length, n)))
hist(sample_means,
main = paste("Sample Size =", n),
xlab = "Sample Mean",
col = "lightblue",
border = "white",
xlim = c(5.5, 6.5))
abline(v = population_mean, col = "red", lwd = 3, lty = 2)
}Central Tendency: - Mean: Average (sensitive to outliers) - Median: Middle value (robust to outliers) - Mode: Most frequent value
Dispersion: - Range: Max - Min (sensitive to extremes) - IQR: Q3 - Q1 (robust measure) - Standard Deviation: Average deviation from mean
# ===== DATA EXPLORATION =====
# Load data
data(dataset_name)
head(df) # First 6 rows
tail(df) # Last 6 rows
str(df) # Structure
summary(df) # Summary statistics
dim(df) # Dimensions
names(df) # Variable names
# ===== DESCRIPTIVE STATISTICS =====
mean(x) # Mean
median(x) # Median
sd(x) # Standard deviation
var(x) # Variance
range(x) # Range
IQR(x) # Interquartile range
quantile(x) # Quartiles
table(x) # Frequency table
# ===== VISUALIZATION =====
hist(x) # Histogram
boxplot(x) # Boxplot
plot(x, y) # Scatterplot
barplot(table(x)) # Bar plot
pairs(df) # Scatterplot matrix
# ===== TIDYVERSE =====
df %>%
filter(condition) %>% # Filter rows
select(var1, var2) %>% # Select columns
mutate(new_var = ...) %>% # Create new variables
group_by(category) %>% # Group data
summarise(mean = mean(x)) # Summarize
# ===== MODELING =====
model <- lm(y ~ x, data = df) # Linear regression
summary(model) # Model summary
predict(model, newdata = new_df) # Predictions
plot(model) # Diagnostic plots
# ===== TESTING =====
t.test(x, y) # T-test
cor.test(x, y) # Correlation test
chisq.test(x, y) # Chi-square test
shapiro.test(x) # Normality test# ❌ BAD: Not checking for missing values
mean(data$variable)
# ✅ GOOD: Handle missing values
mean(data$variable, na.rm = TRUE)
# ❌ BAD: Assuming normality without checking
t.test(group1, group2)
# ✅ GOOD: Check normality first
shapiro.test(group1)
hist(group1)
# Then decide on appropriate test
# ❌ BAD: Not setting seed for reproducibility
sample(data, 10)
# ✅ GOOD: Set seed for reproducible results
set.seed(123)
sample(data, 10)
# ✅ BEST: Use seedhash for reproducible, named seeds
library(seedhash)
gen <- SeedHashGenerator$new("YourName")
seeds <- gen$generate_seeds(5)
set.seed(seeds[1])
sample(data, 10)
# ❌ BAD: Ignoring assumptions
model <- lm(y ~ x)
# Use model without checking
# ✅ GOOD: Check assumptions
model <- lm(y ~ x)
plot(model) # Check diagnostic plots# Data Manipulation
install.packages("dplyr") # Data wrangling
install.packages("tidyr") # Data tidying
install.packages("data.table") # Fast data manipulation
# Visualization
install.packages("ggplot2") # Advanced plotting
install.packages("plotly") # Interactive plots
install.packages("ggpubr") # Publication-ready plots
# Statistics
install.packages("moments") # Skewness & kurtosis
install.packages("car") # Regression diagnostics
install.packages("psych") # Psychological statistics
# Reproducibility
install.packages("seedhash") # Reproducible seed generation
# Tables
install.packages("knitr") # Basic tables
install.packages("kableExtra") # Enhanced tables
install.packages("gt") # Grammar of tables?function_name or
help(function_name)Using the mtcars dataset:
mpgcyl (number of
cylinders)hp (horsepower)wt (weight)mpg with appropriate labelsmpg across different
cyl valueswt vs mpg with a
regression line## R version 4.5.1 (2025-06-13 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 26200)
##
## Matrix products: default
## LAPACK version 3.12.1
##
## locale:
## [1] LC_COLLATE=English_United States.utf8
## [2] LC_CTYPE=English_United States.utf8
## [3] LC_MONETARY=English_United States.utf8
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.utf8
##
## time zone: America/New_York
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] effsize_0.8.1 plotly_4.11.0 moments_0.14.1 kableExtra_1.4.0
## [5] knitr_1.50 lubridate_1.9.4 forcats_1.0.1 stringr_1.5.2
## [9] dplyr_1.1.4 purrr_1.1.0 readr_2.1.5 tidyr_1.3.1
## [13] tibble_3.3.0 ggplot2_4.0.0 tidyverse_2.0.0 seedhash_0.1.0
##
## loaded via a namespace (and not attached):
## [1] sass_0.4.10 generics_0.1.4 xml2_1.4.1 lattice_0.22-7
## [5] stringi_1.8.7 hms_1.1.4 digest_0.6.37 magrittr_2.0.4
## [9] evaluate_1.0.5 grid_4.5.1 timechange_0.3.0 RColorBrewer_1.1-3
## [13] fastmap_1.2.0 Matrix_1.7-4 jsonlite_2.0.0 httr_1.4.7
## [17] mgcv_1.9-3 crosstalk_1.2.2 viridisLite_0.4.2 scales_1.4.0
## [21] lazyeval_0.2.2 textshaping_1.0.4 jquerylib_0.1.4 cli_3.6.5
## [25] rlang_1.1.6 splines_4.5.1 withr_3.0.2 cachem_1.1.0
## [29] yaml_2.3.10 tools_4.5.1 tzdb_0.5.0 vctrs_0.6.5
## [33] R6_2.6.1 lifecycle_1.0.4 htmlwidgets_1.6.4 pkgconfig_2.0.3
## [37] pillar_1.11.1 bslib_0.9.0 gtable_0.3.6 data.table_1.17.8
## [41] glue_1.8.0 systemfonts_1.3.1 xfun_0.54 tidyselect_1.2.1
## [45] rstudioapi_0.17.1 farver_2.1.2 nlme_3.1-168 htmltools_0.5.8.1
## [49] labeling_0.4.3 rmarkdown_2.30 svglite_2.2.2 compiler_4.5.1
## [53] S7_0.2.0
End of Week 02: R for Data Analytics Tutorial
ANLY 500 - Analytics I
Harrisburg University